Search CORE

173 research outputs found

Recommended from our members

Continuous deep q-learning with model-based acceleration

Author: Gu S
Levine S
Lillicrap T
Sutskever U
Publication venue: 33rd International Conference on Machine Learning, ICML 2016
Publication date: 02/03/2016
Field of study

This is the version of record. It originally appeared on arXiv at http://arxiv.org/abs/1603.00748.Model-free reinforcement learning has been successfully applied to a range of challenging problems, and has recently been extended to handle large neural network policies and value functions. However, the sample complexity of modelfree algorithms, particularly when using highdimensional function approximators, tends to limit their applicability to physical systems. In this paper, we explore algorithms and representations to reduce the sample complexity of deep reinforcement learning for continuous control tasks. We propose two complementary techniques for improving the efficiency of such algorithms. First, we derive a continuous variant of the Q-learning algorithm, which we call normalized adantage functions (NAF), as an alternative to the more commonly used policy gradient and actor-critic methods. NAF representation allows us to apply Q-learning with experience replay to continuous tasks, and substantially improves performance on a set of simulated robotic control tasks. To further improve the efficiency of our approach, we explore the use of learned models for accelerating model-free reinforcement learning. We show that iteratively refitted local linear models are especially effective for this, and demonstrate substantially faster learning on domains where such models are applicable

Apollo (Cambridge)

Q-PrOP: Sample-efficient policy gradient with an off-policy critic

Author: Ghahramani Z
Gu S
Levine S
Lillicrap T
Turner RE
Publication venue: 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings
Publication date: 01/01/2017
Field of study

Model-free deep reinforcement learning (RL) methods have been successful in a wide variety of simulated domains. However, a major obstacle facing deep RL in the real world is their high sample complexity. Batch policy gradient methods offer stable learning, but at the cost of high variance, which often requires large batches. TD-style methods, such as off-policy actor-critic and Q-learning, are more sample-efficient but biased, and often require costly hyperparameter sweeps to stabilize. In this work, we aim to develop methods that combine the stability of policy gradients with the efficiency of off-policy RL. We present Q-Prop, a policy gradient method that uses a Taylor expansion of the off-policy critic as a control variate. Q-Prop is both sample efficient and stable, and effectively combines the benefits of on-policy and off-policy methods. We analyze the connection between Q-Prop and existing model-free algorithms, and use control variate theory to derive two variants of Q-Prop with conservative and aggressive adaptation. We show that conservative Q-Prop provides substantial gains in sample efficiency over trust region policy optimization (TRPO) with generalized advantage estimation (GAE), and improves stability over deep deterministic policy gradient (DDPG), the state-of-the-art on-policy and off-policy methods, on OpenAI Gym's MuJoCo continuous control environments

arXiv.org e-Print Archive

Apollo (Cambridge)

MPG.PuRe

Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning

Author: Ghahramani Z
Gu S
Levine S
Lillicrap T
Schölkopf B
Turner RE
Publication venue: Advances in Neural Information Processing Systems
Publication date: 01/01/2017
Field of study

Off-policy model-free deep reinforcement learning methods using previously collected data can improve sample efficiency over on-policy policy gradient techniques. On the other hand, on-policy algorithms are often more stable and easier to use. This paper examines, both theoretically and empirically, approaches to merging on- and off-policy updates for deep reinforcement learning. Theoretical results show that off-policy updates with a value function estimator can be interpolated with on-policy policy gradient updates whilst still satisfying performance bounds. Our analysis uses control variate methods to produce a family of policy gradient algorithms, with several recently proposed algorithms being special cases of this family. We then provide an empirical comparison of these techniques with the remaining algorithmic details fixed, and show how different mixing of off-policy gradient estimates with on-policy samples contribute to improvements in empirical performance. The final algorithm provides a generalization and unification of existing deep policy gradient techniques, has theoretical guarantees on the bias introduced by off-policy updates, and improves on the state-of-the-art model-free deep RL methods on a number of OpenAI Gym continuous control benchmarks

arXiv.org e-Print Archive

Apollo (Cambridge)

MPG.PuRe

High fidelity progressive reinforcement learning for agile maneuvering UAVs

Author: Abbeel P.
Faust A.
Kim H. J.
Lillicrap T. P.
Uzun S.
Yuksek B.
Zhang T.
Publication venue: 'American Institute of Aeronautics and Astronautics (AIAA)'
Publication date: 05/01/2020
Field of study

In this work, we present a high fidelity model based progressive reinforcement learning method for control system design for an agile maneuvering UAV. Our work relies on a simulation-based training and testing environment for doing software-in-the-loop (SIL), hardware-in-the-loop (HIL) and integrated flight testing within photo-realistic virtual reality (VR) environment. Through progressive learning with the high fidelity agent and environment models, the guidance and control policies build agile maneuvering based on fundamental control laws. First, we provide insight on development of high fidelity mathematical models using frequency domain system identification. These models are later used to design reinforcement learning based adaptive flight control laws allowing the vehicle to be controlled over a wide range of operating conditions covering model changes on operating conditions such as payload, voltage and damage to actuators and electronic speed controllers (ESCs). We later design outer flight guidance and control laws. Our current work and progress is summarized in this work

Crossref

Cranfield CERES

vrAIn: a deep learning approach tailoring computing and radio resources in virtualized RANs

Author: Bega D.
Goodfellow I.
Kawser M. T.
Li Y.
Lillicrap T. P.
Rost P.
Silver D.
Turner P.
U.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 05/09/2019
Field of study

Proceeding of: 25th Annual International Conference on Mobile Computing and Networking (MobiCom'19), October 21-25, 2019, Los Cabos, Mexico.The virtualization of radio access networks (vRAN) is the last milestone in the NFV revolution. However, the complex dependencies between computing and radio resources make vRAN resource control particularly daunting. We present vrAIn, a dynamic resource controller for vRANs based on deep reinforcement learning. First, we use an autoencoder to project high-dimensional context data (traffic and signal quality patterns) into a latent representation. Then, we use a deep deterministic policy gradient (DDPG) algorithm based on an actor-critic neural network structure and a classifier to map (encoded) contexts into resource control decisions. We have implemented vrAIn using an open-source LTE stack over different platforms. Our results show that vrAIn successfully derives appropriate compute and radio control actions irrespective of the platform and context: (i) it provides savings in computational capacity of up to 30% over CPU-unaware methods; (ii) it improves the probability of meeting QoS targets by 25% over static allocation policies using similar CPU resources in average; (iii) upon CPU capacity shortage, it improves throughput performance by 25% over state-of-the-art schemes; and (iv) it performs close to optimal policies resulting from an offline oracle. To the best of our knowledge, this is the first work that thoroughly studies the computational behavior of vRANs, and the first approach to a model-free solution that does not need to assume any particular vRAN platform or system conditions.The work of University Carlos III of Madrid was supported by H2020 5GMoNArch project (grant agreement no. 761445) and H2020 5G-TOURS project (grant agreement no. 856950). The work of NEC Laboratories Europe was supported by H2020 5GTRANSFORMER project (grant agreement no. 761536) and 5GROWTH project (grant agreement no. 856709). The work of University of Cartagena was supported by Grant AEI/FEDER TEC2016-76465-C2-1-R (AIM) and Grant FPU14/03701.Publicad

Crossref

Universidad Carlos III de Madrid e-Archivo

Hemophilia gene therapy knowledge and perceptions: Results of an international survey

Author: A. Srivastava
C. McLintock
D. Lillicrap
F. Peyvandi
J. Mahlangu
K.J. Pasi
S.W. Pipe
T. VandenDriessche
W. Scales
Publication venue: 'Wiley'
Publication date: 01/01/2020
Field of study

Background Hemophilia gene therapy is a rapidly evolving therapeutic approach in which a number of programs are approaching clinical development completion. Objective The aim of this study was to evaluate knowledge and perceptions of a variety of health care practitioners and scientists about gene therapy for hemophilia. Methods This survey study was conducted February 1 to 18, 2019. Survey participants were members of the ISTH, European Hemophilia Consortium, European Hematology Association, or European Association for Hemophilia and Allied Disorders with valid email contacts. The online survey consisted of 36 questions covering demographic information, perceptions and knowledge of gene therapy for hemophilia, and educational preferences. Survey results were summarized using descriptive statistics. Results Of the 5117 survey recipients, 201 responded from 55 countries (4% response rate). Most respondents (66%) were physicians, and 59% were physicians directly involved in the care of people with hemophilia. Among physician respondents directly involved in hemophilia care, 35% lacked the ability to explain the science of adeno-associated viral gene therapy for hemophilia, and 40% indicated limited ability or lack of comfort answering patient questions about gene therapy for hemophilia based on clinical trial results to date. Overall, 75% of survey respondents answered 10 single-answer knowledge questions correctly, 13% incorrectly, and 12% were unsure of the correct answers. Conclusions This survey highlighted knowledge gaps and educational needs related to gene therapy for hemophilia and, along with other inputs, has informed the development of "Gene Therapy in Hemophilia: An ISTH Education Initiative.

AIR Universita degli studi di Milano

Mastering the game of Go without human knowledge

Author: Antonoglou I
Baker L
Bolton A
Chen Y
Graepel T
Guez A
Hassabis D
Huang A
Hubert T
Hui F
Lai M
Lillicrap T
Schrittwieser J
Sifre L
Silver D
Simonyan K
van den Driessche G
Publication venue: NATURE PUBLISHING GROUP
Publication date: 19/10/2017
Field of study

A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo

UCL Discovery

Developing a multivariable prediction model for functional outcome after reperfusion therapy for acute ischaemic stroke: study protocol for the Targeting Optimal Thrombolysis Outcomes (TOTO) multicentre cohort study.

Author: Attia J
Bivard A
Choi PMC
Hamilton-Bruce MA
Holliday E
Kleinig T
Koblar S
Levi C
Lillicrap T
Lin L
Lincz LF
Maguire J
Parsons MW
Rao SR
Snel MF
Trim PJ
Worrall BB
Publication venue: 'BMJ'
Publication date: 07/08/2020
Field of study

INTRODUCTION:Intravenous thrombolysis (IVT) with recombinant tissue plasminogen activator (rt-PA) is the only approved pharmacological reperfusion therapy for acute ischaemic stroke. Despite population benefit, IVT is not equally effective in all patients, nor is it without significant risk. Uncertain treatment outcome prediction complicates patient treatment selection. This study will develop and validate predictive algorithms for IVT response, using clinical, radiological and blood-based biomarker measures. A secondary objective is to develop predictive algorithms for endovascular thrombectomy (EVT), which has been proven as an effective reperfusion therapy since study inception. METHODS AND ANALYSIS:The Targeting Optimal Thrombolysis Outcomes Study is a multicenter prospective cohort study of ischaemic stroke patients treated at participating Australian Stroke Centres with IVT and/or EVT. Patients undergo neuroimaging using multimodal CT or MRI at baseline with repeat neuroimaging 24 hours post-treatment. Baseline and follow-up blood samples are provided for research use. The primary outcome is good functional outcome at 90 days poststroke, defined as a modified Rankin Scale (mRS) Score of 0-2. Secondary outcomes are reperfusion, recanalisation, infarct core growth, change in stroke severity, poor functional outcome, excellent functional outcome and ordinal mRS at 90 days. Primary predictive models will be developed and validated in patients treated only with rt-PA. Models will be built using regression methods and include clinical variables, radiological measures from multimodal neuroimaging and blood-based biomarkers measured by mass spectrometry. Predictive accuracy will be quantified using c-statistics and R2. In secondary analyses, models will be developed in patients treated using EVT, with or without prior IVT, reflecting practice changes since original study design. ETHICS AND DISSEMINATION:Patients, or relatives when patients could not consent, provide written informed consent to participate. This study received approval from the Hunter New England Local Health District Human Research Ethics Committee (reference 14/10/15/4.02). Findings will be disseminated via peer-reviewed publications and conference presentations

OPUS - University of Technology Sydney